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Abstract 

The paper deals with on-line regression settings with signals belonging 
to a Banach lattice. Our algorithms work in a semi-online setting where all 
the inputs are known in advance and outcomes are unknown and given step 
by step. We apply the Aggregating Algorithm to construct a prediction 
method whose cumulative loss over all the input vectors is comparable 
with the cumulative loss of any linear functional on the Banach lattice. 
As a by-product we get an algorithm that takes signals from an arbitrary 
domain. Its cumulative loss is comparable with the cumulative loss of any 
predictor function from Besov and Triebel-Lizorkin spaces. We describe 
several applications of our setting. 

1 Introduction 

In this paper we consider an online regression task. A sequence of outcomes 
is predicted step by step. In the beginning of each step we are given a signal 
related to an outcome. After we make our prediction, the true outcome is 
announced. We are interested to match a relationship between signals and their 
outcomes. In a simple case each signal is an input vector of some variables 
and this relationship is assumed to be linear; linear regression minimising the 
expected loss is studied in statistics. We assess the quality of predictions by 
means of the loss accumulated over several trials. This loss is compared against 
the loss of predictors from some benchmark class. In this paper we prove the 
upper bounds for the cumulative losses of our algorithms in the form 

L T <L T (6)+R(T,0) (1) 

where the T is the number of prediction step, Lt is the cumulative loss of 
an algorithm over T steps, Lt(0) is the loss of any predictor 9 from a chosen 
benchmark class over T steps, and R(T, 8) is an additional term called a regret 



term. We say that our algorithm competes with any function from a chosen 
benchmark class if the order of R{T 1 6) by T is sublinear. 

The case when the signal is an input v ector of some variables is well st udied in 
computer learning. Many algorithms fsee lCesa-Bianchi and Lugosi . 20061 Chap- 
ter 10) compete with the benchmark class of linear functions of input vectors 
R™ — > R. Some of them can be generalized to compete with the benchmark class 



of all functions from a Reproducing Kernel Hilbe rt spaces ([Gammerman et al 
2004 iKivinen and Warmuthl . 12004 IVovkl l2006bl) . 



The novelty of this paper is in the expansion of the class of signals to the 
signals from abstract normed vector spaces. In this paper we consider Banach 
lattices, they are Banach spaces with some additional structural assumptions. 
The performance of our algorithm is compared with the performance of any 
vector from a dual lattice (so with a linear predictor on the signal). We show 
that this framework can be useful when signal s are d igital images or sounds. 

From one side, by example of AAR (IVovkl . l200lh we show that algorithms 
developed to compete with linear functions of a vector input can be slightly 
modified to work in our framework. On the other side, this surprising result 
comes with the assumption that all the input signals are known in advance. 
We call it semi-online setting. We show that in some applications of online 
regression a semi-online algorithm does not appear as a drawback. 

We modify our algorithm to be able to work with finite-dimensional input 
vectors from a domain of R m and the benchmark class of functions belonging 
to a Sobolev space. This may give a wide spectrum of applications, for example 
prediction of Bro wnian motion (which almost surely can be said to belong to a 
Sobolev space, see IVovkLl2007h . 

The paper is organized as follows. In Section [2] we give proofs of the the- 
oretical bounds for the performance of an algorithm taking finite-dimensional 
input vectors. The benchmark class of predictors is a class of linear functions 
of input vectors having non-euclidian norms. Section [3] describes the proof of 
the main theoretical bounds for an algorithm working with Banach lattices. In 
Section[3]we describe several applications of our algorithms. Section [5] discusses 
some open problems Wc include some complicated proofs and algorithms in the 
Appendix. 



2 Competing with different norms 

A game of prediction contains three components: a space of outcomes fi, a 
decision space T, and a loss function A : x T — > M. We are interested in 
the square- loss game with fi = [— Y, Y],Y > 0, T = R, and the loss function 
HVil) — (v ~~ l) 2 iV G f^,7 G T. The game of prediction is being played 
repeatedly by a learner receiving some signals xt from a linear space S, and 
follows the prediction protocol: 
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Protocol 1 Online regression 



L :=0. 

for t = 1,2,... do 

Reality announces a signal xt € S 1 . 

Learner announces 74 € I\ 

Reality announces y t £ £1. 

L t := L t -i + A(y t ,7 t ). 
end for 



Here Lt is the cumulative loss of the learner. We are interested in obtaining 
upper bounds on the loss of the learner in the form (JT|) for any 9 <E S*. The 
quality of the work of the learner can be measured by the order of growth of 
the regret term in T. 

We use the prediction method calle d Aggregating Algorithm for Regression 



( AAR) developed initially (|Vovkl . 120011 ) for the case S = M™ . It takes a param- 



eter a > and gives its prediction of an outcome at a step T by formula 




xtx t XT- 



Here I is n x n identity matrix. It performs as well as any linear predictor 
(given an input x (column vector), a predictor 6 predicts Q'x). It is known that 



Theorem (IVovk[l200lh . For all a > 0, all positive integers T, all input vectors 



xi,x 2 , . . .xt € K" such that |M|oo < X,t = 1,2, ...,T, and all 9 e R™, the 
loss of AAR satisfies 

L T {AAR) < L T {6) + a\\9\\l + nY 2 In + lj . (2) 

The regret term in this bound has a logarithmic order of growth in T but it 
is linear in n. Therefore it is applicable for the case of small dimension n and 
large T. We shall now prove an upper bound that grows slowly in n and depends 
on non-euclidian norms in R™. We use the constants of the norms equivalence. 

Lemma 1. Let a G W\ l<p<2, and l/p+ l/q = 1. Then 

\\ a h < Nli» 
||a|| 2 <n 1/2 - 1/9 |M| 9 . 

Proof. The first inequality follows from the fact that the function f(p) = \\a\\ p 
is decreasing in p. Indeed, for p ^= 

F i=l \i=l J 
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To prove the s econd inequality we consider the Holder inequality for x, y 6 
1" and b > 1 (see iBeckenbach and Bellmanl . Il96ll p. 21): 

n / n \ V 6 / n \ V c 

Ei«i< EN b Ei^i c 

i=l \i=l / \i=l / 

for 1/6 + 1/c = 1. This implies 

n / n \ 2 /l ( n \ ^ 

Nil=Ew 2 ^ E(N 2 ) 9/2 Ei 1 ^ ■ 

i=l \t=l / \i=l / 

for 6 = g/2 > 1 and c = Therefore ||a|| 2 < rt 1 / 2 " x /« || o|| Q . □ 

We denote the space of n-dimensional real vectors x = (x , . . . , a;™) equipped 

with the g-norm ||a;|| g = (X^iLiG^I) 9 ) 1 '' 9 by t%, q > 1. Let p be such that 
l/p + l/g=l. 

Lemma 2. for eac/i positive integer T and any real positive Y,X there is 
a constant a > suc/i that for any sequence (xi, y% ),..., (xt<, J/t) suc/i i/iai 
Iktllg < ^",|yt| < = 1,2,...,T and a^Z 9 € ^ i/ie /oss of AAR with the 
parameter a satisfies 

Lt(AAR) < L T (0) + (Y 2 X 2 + \\9\\ 2 p )T 1 / 2 n 1 / 2 ' 1 / m ^ q ^e. (3) 

Proof. Following (|Vovkl . l2006bl Theorem 3) we get 

L T (AAR) < Lr(<?) + a||0||£ + Y 2 T maXt=1 - - r . 

a 

If g > 2, then by Lemma Q] ||x t ||l < n 1 ' 2 ^^ 2 and ||0||| < ||0||£. This 
leads to the regret term a||0||p + Y Tn - — By choosing a = VTn 1 ~ 2 / q we 
obtain the regret term (Y 2 X 2 + \\6\\ 2 p )T 1 / 2 n 1 / 2 - 1 / q . 

If 1 < q < 2, then by Lemma Q] 1 1 x t 1 1 1 < an d ||0||2 < n i-2/ P ^2^ Thig 

leads to the regret term <m 1-2 / p ||0||p + Y T a x . For the same a = \/Tn 1 ~ 2 / q = 
VT^/p- 1 we obtain the regret term (Y 2 X 2 + ||0|| 2 ,)T 1 / 2 n 1 / 2 - 1 /P. □ 

Remark. We can deduce another bound from (J2|). We have ||x||oo < I Ml? f° r 
any g > 1- Since ||0||2 < ||0|| p if p < 2 the upper bound becomes 

( TX 2 

Lt(AAR) < L T {9) + a\\e\\ 2 p + Y 2 nln ( —— + 1 

for g > 2. Since ||0|| 2 < n 1 / 2 - 1 /?^^ if p > 2 the upper bound becomes 

/TX 2 

Lt(AAR) < L T (8) + an l/2 - l ' p \\e\\l + Y 2 n\n + 1 
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for 1 < q < 2. The last bound is better in T but worse in n than In our 
main theorem we consider spaces of infinite dimension. The role of n is played 
there by the dimension of the span of the inputs so far, which is generally T, 
and only the bound similar to © remains nontrivial. 

Many researchers in machine learning consider kernel methods. Some algo- 
rithms which use kernels are able to compete with functions from a Reproducing 
Kernel Hilbert space. Our abstract framework allows us to formulate the upper 
bound on the loss of an algorithm working in an abstract Hilbert space S = H. 
We denote the scalar product in H by (•,•). The algorithm which we use is called 
KAAR (Kernelized AAR). It takes a parameter a > and gives its prediction 
of an outcome at a step T by formula 

7r = (yi,---,VT-i,0)(aI + K) k(x T ), (4) 

Here I is T x T identity matrix, if is a matrix of mutual scalar products 
(xi,Xj),Xi € H,i,j — 1, . . . ,T, and k(xr) is the last column of K. It performs 
as well as any linear predictor h € H (given an input x, a predictor h predicts 
(h,x) ) . The following theoretical bound for KAAR follows from Theorem 3 in 
Vovkl (|2006bh . 



Theorem (KAAR theoretical bound). For any a > 0, every positive integer T , 
any sequence (x\,y-y), . . . , (xy, yx),x t £ H, \y t \ < Y, t = 1, . . . , T, and all h € H , 
the loss of KAAR satisfies 

L T {KAAR) < L T (h) + a\\h\\ 2 H + r 2 lndet (l+~icj . (5) 

3 Theoretical bound for the algorithm compet- 
ing with Banach lattices 

In this section we need to consider a different protocol than Protocol [T] The 
learner plays the game following semi-online Protocol [5J 



Protocol 2 Semi-online abstract regression 
Lo :=0. 

Reality announces number of steps T and signals xi, . . . , Xt € S. 
for t = 1,2,..., T do 

Learner announces 7t G R. 

Reality announces y t € [— Y,Y]. 

L t := L t -x + (y t - 7t) 2 - 
end for 



He competes with all the functions from the dual space S*. His algorithm 
BLAAR (Banach Lattices-competing AAR) working in L p (fi),p > 1 spaces is 
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described as Algorithm Q] and derived in the Appendix. Recall, that L p (/i) is 
the space of all /i-equivalent classes of p-integrable /i-measurable functions on a 
/z-measurable space X: 

\\fh pM = ( [ \f\ p dv) ^ <oo. 



x 



We use the notation L p = 



Algorithm 1 BLAAR for L p . 

Reality announces number of steps T and signals X\, . . . ,xt € L p . 

Step 1. Find the linearly independent subset of X\, . . . , xt with the maximum 

number of vectors: x ri , ■ • ■ , x r „ . 

Step 2. Solve the following optimization problem. Maximize the absolute 
value of the determinant of a matrix C = {cij}ij of sizes nxn: | dct C\ — > max 
with a restriction 



< 1 , where ji = > c ij x rj ■ 

i=i 

Let the matrix D be the inversion of the matrix C: D = C . 

Step 3. Take a = VTnrPT^T/pl , tj sc j t as a param eter for KAAR. 

for t = 1,2,..., T do 

Let x s = 2»=i a siX ri for s = 1, . . . , T. Apply KAAR for prediction at each 
step by formula (jU) . In the matrix of scalar products use 

n n 

K s i = - ^ o^si^ij ^ djkdjk, s,l = l,...,T. (6) 

i,j=l k—l 

end for 



\ 



Eh'. 



We prove the following upper bound for the cumulative loss of BLAAR. It 
performs as well as any linear predictor / € {L p )* (given a signal x, a predictor 
/ predicts f(xj). 

Theorem 1. Suppose we are given p > 1 and X\, . . . , Xt € L p for any positive 
integer T. Assume also that \\x t \\ < X and \y t \ < Y for all t = 1, . . . ,T . Then 
there exists a > such that for all f G (L p )* and any sequence y\, . . . , yr we 
have 

i T (BLAAR(a)) < L T (f) + (Y 2 X 2 + ||/||2) T i/2+|i/2-i/ P |^ ^ 

The proof of this th eorem i s give n in the Appendix. The main argument is 
based on Corollary 5 in iLewis (1978). Note that if in Lemma [5] we take n = T, 



then AAR gives the regret term of the same order T 1-1 / p for £ p ,p > 2. 

It is possible to generalize the result for Banach lattices of more general 
type. The algorithm becomes rather tricky because it is based on the complex 
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interpolation method, and we do not discuss it here. We formulate the theorem 
for the theoreti cal bound of this algorithm. Fir st we give the definition of a 
Banach lattice ( Lindenstrauss and Tzafrirl 1 19791 see). 

Definition 1. A Banach-lattice is a partially ordered Banach space B over the 
reals provided 

(i) x < y implies x + z < y + z, for every x,y, z G B, 

(ii) ax < for every x < in B and every nonnegative real a, 

(iii) for all x,y S B there exists a least upper bound x V y and a greatest lower 
bound x Ay, 

(iv) ||a;|| < \\y\\ whenever |a;| < \y\, where the absolute value of \x\ of x € B is 
defined by |ie| = x V (— x). 

The lattices are a well-studied wide class of Banach spaces. For example, 
any I/ p (/i) is a lattice (consequently, l v is a lattice). Other examples of Banach 
lattices ar e Orlicz spaces. Another mor e intuitive definition of a Banach lattice 
is given in iTomczak- Jaegerman n1 (|l989h : 

Definition 2. If (f2, S, \i) is a measure space then a Banach space £? is called a 
Banach-lattice on (f2,£,/i) if B consists of equivalence classes of ^-measurable 
real functions on such that if / is /i-measurable, g € B, and |/| < \g\ /Lt-a.e., 
then f e B and ||/|| < \\g\\. 

We will use some pointwise expressions with elements of Banach lattices, 

e.g., z — [zZj \fj\ p ) , 1 < p < oo, where {fj} is a finite sequence in B. Our 
main theorem uses the following structural properties of Banach lattices. They 
are similar to convexity properties of standard Banach spaces. 



Definition 3. Assume 1 < p, q < oo and B is a Banach lattice, 
(i) B is called p-convex if there exists a constant M so that Vn, Vxi, 

\ VP 



< 



The smallest possible value of M is denoted by M^- P \B). 
(ii) B is called q-concave if there exists a constant A/ so that Vn, Vaci, 



E"^i 



V? 



< M 



Ei 

\i=i 



1/9 



(with the usual convention for q = oo). The smallest possible value of M 
is denoted by M^(B). 
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Every Banach lattice is 1-convex and oo-concave. As a non-trivial example, 
the space L p (fi) is a p-convex and p-concave Banach lattice with M^ P '(L P ) = 
Mi q )(L p ) = 1 (this can be easily verified). If p > 2, we can think it is p-concave 
and 2-convex, and if 1 < p < 2 w e can think it is p-convex and 2-concave (see 
Lindenstrauss and Tza friri , Il979l Proposition l.d.5). 



Theorem 2. Let B be a p-convex and q-concave Banach lattice B, 1 < p < 
2 < q < oo. Suppose we are given x%, . . . ,xt G B for any positive integer T. 
Assume also that \\xt\\ < X and \yt\ < Y for all t = 1, . . . ,T. Then there exists 
an algorithm taking some a > such that for all f G B* and any sequence 
yx, . . . , yx we have 

i T (Algorithm(a)) < L T (f) + (Y 2 X 2 + ||/|| 2 )Af^(B)Af (g) (i?)r 1 / 2 + Q , (8) 
where a — maxi- — h.h — -}, and s > 0. 

L p 2 ' 2 q J ' 

The proof of this theorem is sim ilar to the proof of Theorem [TJ The main 



argument bases on Theorem 28.6 in Tomczak-Jaegermann (1989) or Corollary 



i.b in IPisierl (| 19791) . though their proof techniques are different from the proof 



technique of the main argument for Theorem [T] The sequence of steps in the 
algorithm follows the steps of the proof of these theorems. 



4 Applications 

In this section we consider different applications of our main theorem. They use 
Theorem [1] rather than Theorem [2J so the algorithm used to give predictions is 
Algorithm [1] 

4.1 Algorithm competing with functional Banach spaces 

A different protocol than Protocol[5]is usually considered in the online regression 
literature: inputs are elements of some domain X C R m . The goal is to find an 
algorithm competing with all the functions from a functional Banach space B 
on this domain X. Many algorithms are capable to compete with Reproducing 
Kernel Hilbert spaces. The generalization of the notion of these spac es for the 
Banach case is called a Proper Banach Functional space dVovkLl2007l) . 



Definition 4. A Proper Banach Functional space (PBFS) on a set X is a 
Banach space B of real-valued functions on X such that the evaluation functional 
ip : / G B i— > f(x) is continuous for each x G X. We will use the notation 
c b{ x ) for the norm of this functional: cg(x) :— supf.ii /iu<i 1/(^)1 and for the 
embedding constant 

c B := sup cb(x) 

assumed to be finite. 
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We will show further examples of PBFS with finite constant eg. We state 
here that it is possible to apply BLAAR to get the following upper bound in 
the standard protocol. It performs as well as any predictor / from a Banach 
lattice which has the PBFS property (given an input vector x £ M™ 1 , a predictor 
/ predicts f(x)). 

Theorem 3. Let X be an arbitrary set and B be a PBFS on X and a q-convex 
and p-concave Banach lattice, 1 < q < 2 < p < oo. Suppose we are given 
Xi, . . . ,xt S X for any positive integer T. Assume also that \yt\ < Y for all 
t = 1, . . . , T . Then it is possible to apply BLAAR with a parameter a > such 
that for all f € B and any sequence y\, . . . ,Ut we have 

L T (BLAAR) < L T (f) + (Y 2 4 + ||/|| 2 )M (p) (B)MW(B)TV2+^ (9) 
where (3 = max{± -5,5- -}■ 

The proof of this theorem bases on the correspondence between X and (B)** 
and Theorem [2] and almost fully repeats the proof of Corollary [T] (follows fur- 
ther). 

The regret term in ^ reaches its minimum by p 7 q when p = q = 2. In this 
case B is a Hilbert space. The PBFS property implies that B is a Reproducing 
Kernel Hilbert Space. In this case, the regret term is of order T 1 / 2 and coincides 
with the order of the regret terms given by the algorithms previously applied 
for competing with RKHS. 

We can not apply Theorem [3] to L p spaces since they are not proper. But 
this theorem covers very important classes of B anach spaces: Besov and Triebel- 
Lizorkin spaces with appropriate parameters (lTriebe]| . [l978h . We start our de- 
scription with the discussion of the algorithm competing with fractional Sobolev 
spaces. 

The main trick used in order to compete with Sobolev spaces is to identify 
each element of them with some element from L p of the same (up to a constant) 
norm and thus to impose a lattice structure on these spaces. This isomorphism 
can be found if X is an open, non-empty subset of R m such that there exists a lin- 
ear ex tension operator (see definition on p. 1372 of lPelczvhski and Woiciechowski . 
l2003h from a Sobolev spa ce W£JX) into W°(R m ). This condition holds for Lip- 
schitz domains (see, e.g., Rogers! . 20061 ). We will further assume our domain is 
a Lipschitz domain. 

Let us take a function u(x) : W n — > E and by 

denote a Fourier transform of u(x). By / v we denote an inverse Fourier trans- 
form 

fV{x)= ^L meixVdv 

for a function /. Then the isomorphism between a Sobolev space and a subspace 
of L p is described by the following theorem. It is constructed using Bcsscl 
potentials. 
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Theorem (Isomorphism of W£ and L p ). Let 1 < p < oo,s > such that 
sp > m. Then W~ may be described as 

W; = {/ G S'(R m ) : ((1 + \\y\\ 2 2 r /2 mY G L p (R m )}, (10) 
where S"(R m ) is a collection of all tempered distributions on W n . 



The mapping in (I10p means the convolution of a given function / and the 
function with a polynomial Fourier transform (1 + Hylli)^ 2 ( the latter called 
Bessel potential) . Explicit expressions of these functions see in lAronszain et al 
(jl963h . 



For the Sobolev spaces with p = 2 and s is an integer number, the proof is 
eas y and based o n the Plancherel's theorem of norm equivalence. For all s,p 
Triebell (|l992l) . Theorem 1.3.2. 



see 



We can apply this theorem to get an algorithm competing with Sobolev 
spaces. This algorithm can be derived from the proof of the following corollary. 

Corollary 1. Assume X is a Lipschitz domain, and Wp(X.) is a fractional 
Sobolev space of functions on X, s > 0,p > 1. Suppose we are given X\, . . . ,xt G 
X for any positive integer T. Assume also that \y t \ < Y for all t = 1, ...,T. 
Then there exists an algorithm taking some a > such that for all f G W p (X) 
and any sequence yi, . . . , j/t we have 

L T (BLAAR) < L T (f) + (Y 2 c 2 Ws + ||/||2)^ T l/2+|i/2-i/p|_ 

p 

Here K is defined by isomorphism between Wp and L p . 
The proof of this corollary is given in the Appendix. 

Lately Besov Bp and Triebel-Lizorkin Fp function spaces begin to inter- 
est researchers due to their conne ctions with wavelets theory. They have the 
PBFS property (see Triebell . 2005 . Proposition 7(h)), and cb» ,cf s < oo. By 



the embedding theorem (|Triebell . Il978t Theorem 2.3) F* -> B s -> F s ' 2 = 



Wp ,1 < p < q < oo,s > s' . Here embedding A — > B means there exists a 
constant C and linear operator T : A — > B such that for any / G A we have 
Tf G B and ||T/|| B < C\\f\\ A . For Slobodetsky spaces B s p = B s pp we can use 
another result from the same theorem: Bp — > W^2 < p < oo, s > 0. It helps 
to keep the parameter s and thus do not increase constants in the regret term. 
Using Corollary [T] we can get the theoretical bound 

L T (BLAAR) < L T (f) + (Y 2 c 2 BF}3pq + ||/||? fl)n . i4 )CT 1 /a+IVa-i/p| 

for all / G {B,F}p q ,p > 1 and some C > defined as a multiplication of the 
embedding constants. An important benchmark class is the class of Holder- 
Zygmund spaces C s = B^ which is embedded to B p = B pp whenever s' < s. 
It is known that fractional Brownian motion B^ almost surely belongs to C s , 
s < h. 
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4.2 Application of the abstract framework 

In this section we describe an example how our algorithm can be used in signal 
processing. A signal can often be interpreted as a function on some domain, 
e.g., a picture can be thought of as a mapping from points to colors. A musical 
fragment can be thought of as a mapping from a point in time into sound 
frequencies. We may be given weak regularity restrictions on the class these 
functions form, e.g., it can be a Sobolev or Besov space. The family of Hilbert 
spaces is reasonably wide, but if lacks many classes of functions of irregular 
behavior. 

Imagine we are given a film consisting frames of resolution 1024 x 768 and 
we want to predict some score calculated from each image. The correct linear 
score for each image is given to us only after we make a prediction about the 
score of this image. Applying the algorithm from Lemma [5] for prediction we 
can get the following upper bound for the square loss of our predictions 

Lt(AAR) < (Y 2 X 2 + \\e\\ 2 p )T 1/2 n 1/2 - 1/p 

where p > 2, n = 1024 x 768 = 786432, T is the length of the film in frames, X 
is the maximal q-norm of images {1/q + l/p = 1), and Y is the upper bound on 
the absolute value of the score. The upper bound from the remark is worse in 
n, and in our example n is the dominating constant for reasonable films length. 
On the other hand, the algorithm from Theorem Q] has the upper bound 

Lr(BLAAR) < (Y 2 X 2 + ||0||2)T 1_1 / , \ 

Then if we want to predict 24 frames per second (say, to detect defective frames), 
the upper bound on the loss of the second algorithm will be better if we work 
with films of duration less than 32768 (around 9 hours). The higher the res- 
olution of the images is the more advantage the second algorithm has. This 
improvement is due to the fact that it finds linearly independent vectors and 
significantly depends only on them. Note that the example above works well in 
the semi-online setting. 



4.3 Learning a classifier 

Online regression algorithms are often applied in the batch setting, when one 
has a training set with input vectors and their labels and a test set containing 
just input vectors. In this case the semi-online setting does not appear as a 
drawback. 

Online r egression methods can be u sed to learn a linear classifier, for example 
Perceptron. ICesa-Bianchi et ah ( 20051) use the AAR algorithm steps to make an 



algorithm to train a Perceptron and to derive upper bounds on the number of 
mistakes. They consider both linear classification and classification in an RKHS. 
We show that the combination of our preprocessing steps and their algorithm 
allows us to learn a classifier working in a PBFS. The abstract protocol may be 
considered here, but we describe the standard protocol to give the reader the 
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better understanding of how our algorithms can be applied to classify vectors. 
Let (xi, yi), . . . , (xt, TJt) be a set of examples, where Xi £ M. m is an input vector 
and yi = { — 1,1} is its label, i = l,...,T. The label corresponds to the class of 
the input vector. Define the hinge loss D 1 (f, (x, y)) = m&x{j — yf(x)}, 7 > of 
any function / from a Sobolev space W£, s > 0,p > 1. If we make preprocessing 
steps described in the proof of Corollary [T] we will get vectors r± , . . . , Tt € 
corresponding to our input vectors, for some n < T. At step t the second-order 
Perceptron Algorithm (see Cesa-Bianchi et all 20051 Figure 3.1) predicts 



Vt = sign 



\i£Mt / ieMt 



where Ait {1, 2, . . .} is the set of indices of mistaken trials (yi 7^ yi, i € Ait) 
before the step t. It is possible to prove the following upper bound on the 
number of mistakes 

Theorem 4. It is possible to run the second-order Perceptron Algorithm on 
any finite sequence (xi,yi), . . . , (it, yr) of examples such that the number k of 
mistakes satisfies 



k < inf min 



i?(/,T,a) 2 , D 7 (f) , R(f,T,a) D y (f) , R(f,T,a) 



~t>0few* \ 2"f 2 7 7 y 7 4j 2 I 

where R 2 (f,T,a) = ^(T^/Pyf^ f(^f) and D y (f) is the 

cumulative hinge loss X)«=i ^i{fi ( x iiUi)) °f f ■ 

Note that T 1 / 2-1 ^ in the theorem above is equal to the maximum number 
of linearly independent inputs (converted to the dual space (Wf )**), so if the 
algorithm is run on the same sequence of inputs several times then this number 
remains the same. 



5 Discussion 



The idea of competing with Banach spaces is not new. Vladimir Vovk (jVovkl . 



2006a, 2007) considers two different ways to do this. The first technique is based 



on the game-theoretic probability theory (jShafer and Vovk . 2001 ) and called 



Defensive Forecasting. The second technique is based on the metric entropy of 
the space with which the learner wishes to compete. The Aggregating Algorithm 
is used for prediction. Suppose that input vectors are taken from a domain 
X C M. m . The main difference in the theoretical bounds for two algorithms 
can be described by an example of Slobodetsky spaces -Bp(X) = J3p p (X). We 
always assume that sp > m: this condition ensures that the elements of are 
continuous functions on X (see, e.g.. lTrieb"el . ll978l) . Assuming p>2, the known 
upper bound on the regret term is of order 0(T 1_1//p ) when the learner uses 
either the Defensive Forecasting or our algorithm from Corollary [T] to predict 
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the outcomes. This order does not depend on s. The order 0(T 17l '( m+s >) is 
provided by Metric Entropy technique. This order does not depend on p and so 
this algorithm can be appl ied to compet e with spaces with p = 1. 

The question asked by IVovkl (|2006ah is whether it is possible to create an 
algorithm which will involve both p and s parameters in the order of T in the 
regret term. Our paper gives another way to apply the Aggregating Algorithm 
and the order of the regret term corresponds to the order given by Defensive 
Forecasting: T 1 ~ 1 / p . Thus we reduced the problem to the analysis of two 
different ways of using the Aggregating Algorithm to mix functions from Banach 
spaces. 

Our paper shows the same order of the regret term by T as the Defensive 
Forecasting method. It allows us to think that this order may be optimal by 
p. The lack of lower bounds for methods which are capable to compete with 
functions from Banach spaces does not allow us to make a strong argument. 
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Appendix 



Proof of Theorem [T] 

The proofs of Theorems Q] and [2] base on the possibility to construct an isomor- 
phism between a finite-dimensional Banach lattice and a Hil bert space such that 



the n orms of vectors do not increase too much. Precisely f see lTomczak- Jaegermann , 
19891 Theorem 28.6), 



Theorem (Distance). Let B be a p- convex and q-concave Banach lattice, and 
X be an n- dimensional subspace of B. Ifl<p<2<q<oo then there exists 
an isomorphic operator U : X — > such that 

inf IIJ7IHIC/" 1 !! <n a A&\B)M {q) (B). (11) 

Here a = max{± - |, | - ±}, \\U\\ = sup xeA - M 1 1| = sup re<1 „ Ji ^7|p 

and infimum is taken over all isomorphisms between X and 



This theorem is formulated for L p , 1 < p < oo spaces as Corollary 5 in Lewis! 



(Il978h . The infimum in this case is bounded by n' 1 / 2 1//p L The expression 



inf^ ||f ||||f 1 || defines a Banach-Mazur distance d(X,£%) betwe e n X and iV^. 
For the case of a general Banach space the John theorem (jjohnl . 1948 ) states 



that d{X,tVff) < yjri for any n-dimensional subspace X G U, where U is some 
Banach space. Thus our theorems can be applied for the cases p = 1 and p = oo 
though the regret term becomes trivial (of order T) . 

Remark. Interestingly, if one wants to omit the lattice structure on a Banach 
space, it is possible to prove a weaker result than the Dist ance theorem. For the 
definit ions and the result we refer to Proposition 27.4 in Tomczak- Jaegermann! 
l|l989h : 

If B is a Banach space of type p and cotype q, l<p<2<q< oo, and X is an 
n-dimensional subspace of B, then diX^P^) < Cn 1 ^ p ~ 1 ^ q for some constant C. 
Clearly, this bound is worse than n", but it can be shown that this is the best 
possible asymptotic estimate for arbitrary p, q provided 1/p— l/q< 1/2. In the 
case 1/p — 1/q > 1/2 this proposition does not give anything new due to the 
fact that y/n is the maximal bound. 

Proof of Theorem [H Let X C L p be the linear span of the input vectors xi, . . . , xt- 
Let the dimension of X be n. By the Distance theorem there exists an isomor- 
phism U : X -> 11 = W n such that 

llc/Hllc/- 1 !! < ^ii/2-i/Pi = c 

Let the norms of the operator U and of the inverse operator U^ 1 be 

\\U\\= sup I«=l,||L/-i||= sup ^fA<c. (12) 
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Here we state the norm of the direct operator equals one since in the other case 
we can replace the operator U by the operator V = U/\\U\\ with unitary norm. 
The norm of the inverse operator then increases by \\U\\. 

By r"j = U(x.i),i = 1, . . . , T we denote images of input vectors Xi applying 
operator U. We apply the KAAR with the scalar product kernel to these images 
n consequently. By formula J5]) we get for any g € (R n )* and a > the loss of 
the learner at a step T satisfies 



L T (BLAAR) < L T (g) + a\\g\\ 2 + ^^k^l, (13) 



where the determinant of the positive definite matrix I + -K (K is the ma- 
trix of scalar products (r^r,)) is boun ded by the product of its diagonal ele- 
ments ( Beckenbach and Bellmanl 1961 , Chapter 2, Theorem 7) and the loga- 



rithm ln(l + x) is bounded by x for x > 0. 

For any / £ B* we take g : R" -> K such that g{r) := /(t/ _1 (r)), Vr € R". 
Since U is an isomorphism, U (r) € i p . The linearity of g follows from the 
linearity of U^ 1 , so g e (R n )*. This means that Lr{g) = £t(/) because the 
values of / and <? are equal on the corresponding vectors from X and R". 

Let us consider any linear functional h : X —¥ R on it such that h(x) — 
/(x), Vx G X. Clearly, 

||/|| B = sup |/(a;)| > sup |/(a;)|= sup \h(x)\ = \\h\\ x . 

||x||=l,x£S ||a;|| = l,a;eAr ||a= || = l,a:e AT 

Then, the norm of h can be lower estimated using (TT2"j) : 

\K x )\ \g( r )\ 1 Is0)| 1 „ „ 

\\h\\ x = sup L ±f-= sup J^2L->- sup i^Ul=_ 5 ||. 

On the other hand, we have ||r|| < ||x|| for all x G A",r = £/(x). Thus ||r.;|| 2 < 
||xi|| 2 ,i = l,...,Tand || 5 || 2 < C 2 \\h\\ 2 < C* 2 ||/|| 2 , so the theoretical bound 
transforms to 

Y 2 TX 2 

L T (BLAAR) < L T (f) + aC 2 \\f\\ 2 + . 

a 

We can choose a — VT/C and recall that n in C is the number of linearly 
independent input vectors among xi,...,xt- Thus n < T, and we get the 
bound 0. □ 

Derivation of Algorithm [1] 

The derivation of o ur algo r ithm is based on the proof of the Distance theorem 
for B = L v given in iLewid (Il978h . 



Step 2: The optimization task. Let span{x n , . . . , x Tn } = Z, dimZ = n 
(we will further omit index r of x-es). We take some basis 4>ii ■ ■ ■ > 4>n G Z* . 
For an isomorphism u : £ 2 — R™ — > Z C L„ we define its determinant by 
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detu = det{(j)i(^j)}ij, where ji = u(ei) for a unit vector basis e, € R n . Then 
we find such u that |detu| — > max and ||7z|| p < 1, where 7^ = ^/Si'li | — | 2 
(the resulting isomorphism is the one where the infimuni is attained in the 
Distance theorem). The maximum determinant is unique up to a constant 
depending on the choice of <pi. It is convenient to choose <pi : <fii( x ) = a i f° r 
any x = J27=i aiXi e % ( so {^i}" * s a biorthogonal system to {2^}"). Then 
I detu| = I det{cij}ij\ for 7; = Cy^i- 

Step 3: The scalar product. The scalar product (O is calculated by 



h- ■ 



/ XiXj\^ z \ p 2 dx,i,j = l : ...,T. 
Jn 



and is equivalent to ([6]) b ecause dg = n J„ 7i7j |7z| p 2 (&z;, i, j = 1, . . . , T from 
the proof of Theorem 1 in iLewid (Il978l) . 



Proof of Corollary [T] 

Sobolev spaces are proper Banach spaces (see Triebell . 20051 Proposition 7(h)). 



The identity mapping from them to C(X) is bounded, so cp/= < 00. From the 
other hand, we impose a lattice structure on the Sobolev space using isomor- 
phism (fTU]) . 

Proof. We represent xi,...,zt as elements on of a dual space (Wp 5 )* . We take 
a i{f) — a xi(f) '■= f(xi)yf € Wp,i = 1,...,T. In this setting we compete 
with the elements of (Wp)**: for each / <E Wp we take gf € (Wp)** such that 
by definition g/(a) := a(f) for any a € (Wp)* . This changing of variables does 
not change the prediction error, since f(xi) = oa(f) = gf(ai). 

The isomorphism theorem states that there exists a linear isomorphism U : 
L p —> Wp between L p and Wp, such that ||Z7|| ||£/ _1 || < K for some constant K. 

\uy\\ r , - 1177-111 - cur, W u ~ 



We also denote C v = \\U\\ = sup, )GLp \^^ c u~ 1 = \\U 1 \\ = sup /eW » 
This isomorphism defines a dual isomorphism U* : (Wp)* —> (L p )* by 

(U*a)(n) - a(U V ), 

where rj € L p and a e (Wp*. Clearly, (Wa^U^f) = a(/),V/ € % s . We 
denote /3 = U*a G (Lp)*. Similarly, we have a correspondence [/** : (Wp)** -> 
(L p )** defined by 

which gives us /i = J7**.g € (L p )** functions to compete with. After these 
replacements we have the same prediction error since for any g 6 (Wp)** we 
get h(fit) = (U**g)(pi) = g^U*)- 1 ^ = g(a t ),i = 1,...,T. The norm of a 
function increases by some constant: 

\h(/3)\ \g(a)\ \g(a)\ . . „ ,, „„ 

-- sup L^f = sup jHi-iL < J^iL ^ = y < c / 
,8 HpII a \\U*a\\ \a(Ur))\ 
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where 77 = U 1 f. The first inequality follows from the fact that ||t/*a|| = 
sup ^ l{U ]"l iv)l > liU *ff' )l = l -^r^V G L p . To apply Theorem Q] we have to 
ensure is bounded. It holds since 

hah = - P = sup m\ < sup i/frOI^ < ^ i=1 T . 

v IMI f \\u l /l I / ll/ll 

Applying Theorem [T] concludes the proof. □ 
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